ML2 Final¶
Datasource¶
Source: Kaggle - Unsupervised Learning on Country Data
Clustering the Countries by using Unsupervised Learning for HELP International Objective: To categorise the countries using socio-economic and health factors that determine the overall development of the country.
About organization: HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.
Problem Statement:¶
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most. World by Income and Region - The World Bank We will try to categorize countries by income group to determine the countries that need more helps. The end goal will be compared to data on The World Bank. We will aim to divide countries into 4 groups:
- High Income
- Middle Upper Income
- Middle Lower Income
- Low Income
Step 1: Importing libraries and loading data¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import plotly.express as px
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from kneed import KneeLocator
warnings.filterwarnings('ignore') # Ignore warnings
df = pd.read_csv('Country-data.csv')
display(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 167 entries, 0 to 166 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 167 non-null object 1 child_mort 167 non-null float64 2 exports 167 non-null float64 3 health 167 non-null float64 4 imports 167 non-null float64 5 income 167 non-null int64 6 inflation 167 non-null float64 7 life_expec 167 non-null float64 8 total_fer 167 non-null float64 9 gdpp 167 non-null int64 dtypes: float64(7), int64(2), object(1) memory usage: 13.2+ KB
None
df.describe(include='all')
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 167 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 |
| unique | 167 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | Afghanistan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | 38.270060 | 41.108976 | 6.815689 | 46.890215 | 17144.688623 | 7.781832 | 70.555689 | 2.947964 | 12964.155689 |
| std | NaN | 40.328931 | 27.412010 | 2.746837 | 24.209589 | 19278.067698 | 10.570704 | 8.893172 | 1.513848 | 18328.704809 |
| min | NaN | 2.600000 | 0.109000 | 1.810000 | 0.065900 | 609.000000 | -4.210000 | 32.100000 | 1.150000 | 231.000000 |
| 25% | NaN | 8.250000 | 23.800000 | 4.920000 | 30.200000 | 3355.000000 | 1.810000 | 65.300000 | 1.795000 | 1330.000000 |
| 50% | NaN | 19.300000 | 35.000000 | 6.320000 | 43.300000 | 9960.000000 | 5.390000 | 73.100000 | 2.410000 | 4660.000000 |
| 75% | NaN | 62.100000 | 51.350000 | 8.600000 | 58.750000 | 22800.000000 | 10.750000 | 76.800000 | 3.880000 | 14050.000000 |
| max | NaN | 208.000000 | 200.000000 | 17.900000 | 174.000000 | 125000.000000 | 104.000000 | 82.800000 | 7.490000 | 105000.000000 |
df.head()
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| 1 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| 2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| 3 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| 4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
df.isna().sum()
country 0 child_mort 0 exports 0 health 0 imports 0 income 0 inflation 0 life_expec 0 total_fer 0 gdpp 0 dtype: int64
EDA¶
import seaborn as sns
import matplotlib.pyplot as plt
numeric_df = df.drop('country', axis=1)
for col in numeric_df.columns:
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
plt.suptitle(f'{col.upper()} Distribution & Outliers', fontsize=16, fontweight='bold')
# Boxplot
sns.boxplot(x=col, data=numeric_df, ax=axes[0], color='lightblue')
axes[0].set_title('Boxplot', fontsize=12)
# Distribution Plot
sns.histplot(numeric_df[col], kde=True, ax=axes[1], color='salmon')
axes[1].set_title('Distribution', fontsize=12)
# Add skewness info
skewness = numeric_df[col].skew()
axes[1].annotate(f'Skewness: {skewness:.2f}', xy=(0.7, 0.9), xycoords='axes fraction', fontsize=10,
bbox=dict(facecolor='white', alpha=0.7))
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
# Calculate the correlation matrix
corr = numeric_df.corr()
# Set up the matplotlib figure
plt.figure(figsize=(10, 8))
# Create the heatmap
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", square=True)
# Add title
plt.title('Correlation Heatmap of Country Statistics', fontsize=16)
# Show plot
plt.show()
EDA Take Aways¶
- Data is pretty cleaned
- No outlier
- Notable positive correlation:
- Income - GDP: 0.90 => Higher GDP often goes with higher income
- Total Fertility Rate - Child Mortality: 0.85 => More babies often goes with babies are more likely to die
- Imports - Exports: 0.85 => More imports often goes with more exports
- Notable negative correlation:
- Life Expectancy - Child Mortality: -0.89 => Better life expectancy associated with less babies dying
- Total Fertility Rate - Life Expectancy: -0.76 => Better life expectancy associated with less babies being borned
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
normalized_df = pd.DataFrame(scaler.fit_transform(numeric_df), columns=numeric_df.columns)
normalized_df.head()
| child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.426485 | 0.049482 | 0.358608 | 0.257765 | 0.008047 | 0.126144 | 0.475345 | 0.736593 | 0.003073 |
| 1 | 0.068160 | 0.139531 | 0.294593 | 0.279037 | 0.074933 | 0.080399 | 0.871795 | 0.078864 | 0.036833 |
| 2 | 0.120253 | 0.191559 | 0.146675 | 0.180149 | 0.098809 | 0.187691 | 0.875740 | 0.274448 | 0.040365 |
| 3 | 0.566699 | 0.311125 | 0.064636 | 0.246266 | 0.042535 | 0.245911 | 0.552268 | 0.790221 | 0.031488 |
| 4 | 0.037488 | 0.227079 | 0.262275 | 0.338255 | 0.148652 | 0.052213 | 0.881657 | 0.154574 | 0.114242 |
PCA¶
- We performs PCA with 0 components
- We can describe upto 97% of the result with 6 PC
- All 6 PC have 0 correlations, as we wanted
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Apply PCA with 9 components
pca = PCA(n_components=9)
pca_components = pca.fit_transform(normalized_df)
# Create a DataFrame for the PCA results
pca_df_9 = pd.DataFrame(data=pca_components, columns=[f'PC{i+1}' for i in range(9)])
pca_df_9['country'] = df['country'] # Optional: add country column back
# Explained variance ratio for each component
explained_variance = pca.explained_variance_ratio_
plt.figure(figsize=(10, 6))
plt.plot(range(1, 10), explained_variance, marker='o', linestyle='--', color='b')
plt.title('Explained Variance by Each Principal Component', fontsize=16)
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.xticks(range(1, 10))
plt.grid(True)
plt.show()
# Cumulative sum of explained variance
cumulative_variance = explained_variance.cumsum()
print(cumulative_variance)
plt.figure(figsize=(10, 6))
plt.plot(range(1, 10), cumulative_variance, marker='o', linestyle='--', color='green')
plt.title('Cumulative Explained Variance by Principal Components', fontsize=16)
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Variance Explained')
plt.xticks(range(1, 10))
plt.grid(True)
plt.show()
[0.55001227 0.6838601 0.80687063 0.9043611 0.94214073 0.97227732 0.98418166 0.99305958 1. ]
from sklearn.decomposition import IncrementalPCA
# Initialize Incremental PCA
ipca = IncrementalPCA(n_components=6)
# Fit and transform
ipca_components = ipca.fit_transform(normalized_df)
# Create a DataFrame for IPCA results
ipca_df = pd.DataFrame(data=ipca_components, columns=[f'PC{i+1}' for i in range(6)])
ipca_df['country'] = df['country'] # Optional: Add back 'country'
ipca_df.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | country | |
|---|---|---|---|---|---|---|---|
| 0 | 0.599055 | 0.095593 | 0.157410 | 0.024592 | 0.044905 | -0.044183 | Afghanistan |
| 1 | -0.158467 | -0.212146 | -0.064079 | 0.061208 | -0.014380 | -0.013885 | Albania |
| 2 | -0.003676 | -0.135878 | -0.134090 | -0.133633 | 0.091475 | 0.024286 | Algeria |
| 3 | 0.650256 | 0.275948 | -0.142585 | -0.156112 | 0.082833 | 0.030830 | Angola |
| 4 | -0.200710 | -0.064655 | -0.100702 | 0.037951 | 0.035631 | -0.057256 | Antigua and Barbuda |
explained_variance_ipca = ipca.explained_variance_ratio_
# Plot explained variance
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(range(1, 7), explained_variance_ipca.cumsum(), marker='o', linestyle='--', color='purple')
plt.title('Cumulative Explained Variance by Incremental PCA Components', fontsize=16)
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Variance Explained')
plt.xticks(range(1, 7))
plt.grid(True)
plt.show()
import numpy as np
import seaborn as sns
final_pca = ipca.fit_transform(normalized_df)
# Calculate the correlation matrix of the PCA components
pc = np.transpose(final_pca) # Transpose to get components in rows
corrmat = np.corrcoef(pc)
# Plot heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corrmat, annot=True, fmt=".2f", linewidth=0.75, cmap="Blues")
plt.title('Correlation Matrix of PCA Components')
plt.show()
K Means Clustering¶
- We perform clustering with many n_clusters and many pca components counts
- Best n clusters are 3-4
Dividing into real groups¶
- We will go with 4 clusters and dividing them into 4 groups of income and compare it to data from The world Bank.
- The result seems to be mostly similar to TWB data
- We still missed some data on countries like Mexico due to data missing countries.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
cluster_range = range(1, 9) # Clusters from 1 to 8
pca_components_range = range(2, 7) # PCA components from 2 to 6
plt.figure(figsize=(18, 18))
# Loop through different numbers of PCA components
for i, n_components in enumerate(pca_components_range, 1):
# Perform PCA with 'n_components' components
pca = PCA(n_components=n_components)
pca_data = pca.fit_transform(final_pca) # Apply PCA to the data
# Loop through different cluster sizes
for j, n_clusters in enumerate(cluster_range, 1):
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
# Fit K-means on the PCA data (using the selected number of components)
kmeans.fit(pca_data)
# Get the labels (cluster assignments) for each data point
labels = kmeans.labels_
# Create a DataFrame to visualize the clusters with PCA components
df_clusters = pd.DataFrame(pca_data, columns=[f'PC{i+1}' for i in range(pca_data.shape[1])])
df_clusters['Cluster'] = labels
# Initialize list to store inertia values
inertia_values = []
# Loop through PCA components and clusters
for n_components in pca_components_range:
pca = PCA(n_components=n_components)
pca_data = pca.fit_transform(final_pca) # Apply PCA to the data
# For each cluster count
for n_clusters in cluster_range:
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans.fit(pca_data)
inertia_values.append((n_components, n_clusters, kmeans.inertia_))
# Create a DataFrame to visualize inertia for each combination of PCA components and clusters
inertia_df = pd.DataFrame(inertia_values, columns=['PCA Components', 'Clusters', 'Inertia'])
# Plot the inertia values
plt.figure(figsize=(10, 6))
sns.lineplot(x='Clusters', y='Inertia', hue='PCA Components', data=inertia_df, marker='o')
plt.title('Inertia vs Number of Clusters and PCA Components')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.grid(True)
plt.show()
<Figure size 1800x1800 with 0 Axes>
from sklearn.cluster import KMeans
import pandas as pd
# Perform K-means clustering with 3 clusters (since you've identified it as optimal)
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(final_pca)
# Get the cluster labels (the tier for each country)
country_clusters = kmeans.labels_
# Assuming the 'df' contains a 'country' column (which holds the names of the countries)
# Add the cluster labels (tiers) as a new column in the dataframe
df['Cluster_Tier'] = country_clusters
# Map the cluster labels to descriptive names (optional)
tier_mapping = {0: 'Tier 1', 1: 'Tier 2', 2: 'Tier 3', 3: 'Tier 4'}
df['Cluster_Tier_Descriptive'] = df['Cluster_Tier'].map(tier_mapping)
# Calculate the average GDP per capita for each tier (cluster)
avg_gdp_per_tier = df.groupby('Cluster_Tier_Descriptive')['gdpp'].mean().reset_index()
# Sort the tiers based on average GDP, highest to lowest
avg_gdp_per_tier_sorted = avg_gdp_per_tier.sort_values(by='gdpp', ascending=False)
# Assign development status based on sorted GDP values
# The highest GDP will be assigned 'Developed', the middle 'Developing', and the lowest 'Underdeveloped'
Income_Status_mapping = {
avg_gdp_per_tier_sorted.iloc[0]['Cluster_Tier_Descriptive']: 'High Income',
avg_gdp_per_tier_sorted.iloc[1]['Cluster_Tier_Descriptive']: 'Upper Middle Income',
avg_gdp_per_tier_sorted.iloc[2]['Cluster_Tier_Descriptive']: 'Lower Middle Income',
avg_gdp_per_tier_sorted.iloc[3]['Cluster_Tier_Descriptive']: 'Low Income'
}
# Map the 'Income_Status' column based on the sorted tiers
df['Income_Status'] = df['Cluster_Tier_Descriptive'].map(Income_Status_mapping)
# Display the countries grouped by their development status
tier_groups = df.groupby('Income_Status')['country'].apply(list).reset_index()
print(tier_groups)
print(df[['country', 'Cluster_Tier_Descriptive', 'Income_Status']].head(20))
Income_Status country
0 High Income [Australia, Austria, Belgium, Brunei, Canada, ...
1 Low Income [Afghanistan, Angola, Benin, Burkina Faso, Bur...
2 Lower Middle Income [Algeria, Bangladesh, Bhutan, Bolivia, Botswan...
3 Upper Middle Income [Albania, Antigua and Barbuda, Argentina, Arme...
country Cluster_Tier_Descriptive Income_Status
0 Afghanistan Tier 4 Low Income
1 Albania Tier 3 Upper Middle Income
2 Algeria Tier 1 Lower Middle Income
3 Angola Tier 4 Low Income
4 Antigua and Barbuda Tier 3 Upper Middle Income
5 Argentina Tier 3 Upper Middle Income
6 Armenia Tier 3 Upper Middle Income
7 Australia Tier 2 High Income
8 Austria Tier 2 High Income
9 Azerbaijan Tier 3 Upper Middle Income
10 Bahamas Tier 3 Upper Middle Income
11 Bahrain Tier 3 Upper Middle Income
12 Bangladesh Tier 1 Lower Middle Income
13 Barbados Tier 3 Upper Middle Income
14 Belarus Tier 3 Upper Middle Income
15 Belgium Tier 2 High Income
16 Belize Tier 3 Upper Middle Income
17 Benin Tier 4 Low Income
18 Bhutan Tier 1 Lower Middle Income
19 Bolivia Tier 1 Lower Middle Income
fig = px.choropleth(df[['country','Cluster_Tier']],
locationmode = 'country names',
locations = 'country',
color = df['Income_Status'],
color_discrete_map = {'High Income': 'Green',
'Upper Middle Income':'LightGreen',
'Lower Middle Income':'Orange', 'Low Income': 'Red'}
)
fig.update_layout(
margin = dict(
l=0,
r=0,
b=0,
t=0,
pad=2,
),
)
fig.show()
